Learning in A Changing World: Restless Multi- Armed Bandit 

with Unknown Dynamics 

Haoyang Liu, Keqin Liu, Qing Zhao 
University of California, Davis, CA 95616 
{liu, kqliu, qzhao}@ ucdavis.edu 

Abstract 

We consider the restless multi-armed bandit (RMAB) problem with unknown dynamics in which a 
player chooses M out of N arms to play at each time. The reward state of each arm transits according 
to an unknown Markovian rule when it is played and evolves according to an arbitrary unknown random 
process when it is passive. The performance of an arm selection policy is measured by regret, defined 
as the reward loss with respect to the case where the player knows which M arms are the most 
rewarding and always plays the M best arms. We construct a policy with an interleaving exploration and 
exploitation epoch structure that achieves a regret with logarithmic order when arbitrary (but nontrivial) 
bounds on certain system parameters are known. When no knowledge about the system is available, we 
show that the proposed policy achieves a regret arbitrarily close to the logarithmic order. We further 
extend the problem to a decentralized setting where multiple distributed players share the arms without 
information exchange. Under both an exogenous restless model and an endogenous restless model, 
we show that a decentralized extension of the proposed policy preserves the logarithmic regret order 
as in the centralized setting. The results apply to adaptive learning in various dynamic systems and 
communication networks, as well as financial investment. 

I. Introduction 

A. Multi-Armed Bandit with i.i.d. and Rested Markovian Reward Models 

In the classic multi-armed bandit (MAB) with an i.i.d. reward model, there are N independent 
arms and a single player. Each arm, when played, offers an i.i.d. random reward drawn from 
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a distribution with unknown mean. At each time, the player chooses one arm to play, aiming 
to maximize the total expected reward in the long run. This problem involves the well-known 
tradeoff between exploitation and exploration, where the player faces the conflicting objectives 
of playing the arm with the best reward history and playing a less explored arm to learn its 
reward statistics. 

A commonly used performance measure of an arm selection policy is the so-called regret or 
the cost of learning, defined as the reward loss with respect to the case with a known reward 
model. It is clear that under a known reward model, the player should always play the arm with 
the largest reward mean. The essence of the problem is thus to identify the best arm without 
engaging other arms too often. Any policy with a sublinear growth rate of regret achieves the 
same maximum average reward (given by the largest reward mean) as in the known model case. 
However, the slower the regret growth rate, the faster the convergence to this maximum average 
reward, indicating a more effective learning ability of the policy. 

In 1985, Lai and Robbins showed that regret grows at least at a logarithmic order with time, 
and an optimal policy was explicitly constructed to achieve the minimum regret growth rate for 
several reward distributions including Bernoulli, Poisson, Gaussian, Laplace [0. Several other 
policies have been developed under different assumptions on the reward distribution [0, [0. 
In particular, an index policy, referred to as Upper Confidence Bound 1 (UCB1) proposed by 
Auer et al. in [0, achieves logarithmic regret for any reward distributions with finite support. 
In Hi, Liu and Zhao proposed a policy that achieves the optimal logarithmic regret order for 
a more general class of reward distributions and sublinear regret orders for heavy-tailed reward 
distributions. 

In 1987, Anantharam et al. extended Lai and Robbin's results to a Markovian reward model 
where the reward state of each arm evolves as an unknown Markov process over successive plays 
and remains frozen when the arm is passive (the so-called rested Markovian reward model) [0. 
In [0, Tekin and Liu extended the UCB1 policy proposed in [3] to the rested Markovian reward 
model. 

B. Restless Multi-Armed Bandit with Unknown Dynamics 

In this paper, we consider Restless Multi-Armed Bandit (RMAB), a generalization of the 
classic MAB. In contrast to the rested Markovian reward model, in RMAB, the state of each arm 
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continues to evolve even when it is not played. More specifically, the state of each arm changes 
according to an unknown Markovian transition rule when the arm is played and according to an 
arbitrary unknown random process when the arm is not played. We consider both the centralized 
(or equivalently, the single-player) setting and the decentralized setting with multiple distributed 
players. 

1 ) Centralized Setting: A centralized setting where M players share their observations and 
make arm selections jointly is equivalent to a single player who chooses and plays M arms 
simultaneously. The performance measure regret is similarly defined: it is the reward loss 
compared to the case when the player knows which arms are the most rewarding and always 
plays the M best arms. 

Compared to the i.i.d. and the rested Markovian reward models, the restless nature of arm 
state evolution requires that each arm be played consecutively for a period of time in order to 
learn its Markovian reward statistics. The length of each segment of consecutive plays needs 
to be carefully controlled to avoid spending too much time on a bad arm. At the same time, 
we experience a transient each time we switch out and then back to an arm, which leads to 
potential reward loss compared to the steady-state behavior of this arm. Thus, the frequency of 
arm switching needs to be carefully bounded. 

To balance these factors, we construct a policy based on a deterministic sequencing of 
exploration and exploitation (DSEE) with an epoch structure. Specifically, the proposed policy 
partitions the time horizon into interleaving exploration and exploitation epochs with carefully 
controlled epoch lengths. During an exploration epoch, the player partitions the epoch into N 
contiguous segments, one for playing each of the N arms to learn their reward statistics. During 
an exploitation epoch, the player plays the arm with the largest sample mean (i.e., average reward 
per play) calculated from the observations obtained so far. The lengths of both the exploration 
and the exploitation epochs grow geometrically. The number of arm switchings are thus at the 
logrithmic order with time. The tradeoff between exploration and exploitation is balanced by 
choosing the cardinality of the sequence of exploration epochs. Specifically, we show that with 
an O(logt) cardinality of the exploration epochs, sufficiently accurate learning of the arm ranks 
can be achieved when arbitrary (but nontrivial) bounds on certain system parameters are known, 
and the DSEE policy offers a logarithmic regret order. When no knowledge about the system 
is available, we can increase the cardinality of the exploration epochs by an arbitrarily small 
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order and achieve a regret arbitrarily close to the logarithmic order, i.e., the regret has order 
fit) logt for any increasing divergent function fit). In both cases, the proposed policy achieves 
the maximum average reward offered by the M best arm. 

We point out that the definition of regret here, while similar to that used for the classic MAB, 
is a weaker version of its counterpart. In the classic MAB with either i.i.d. or rested Markovian 
rewards, the optimal policy under a known model is indeed to stay with the best arm in terms of 
the reward mearQ. For RMAB, however, the optimal policy under a known model is no longer 
given by staying with the arm with the largest reward mean. Unfortunately, even under known 
Markovian dynamics, RMAB has been shown to be P-SPACE hard [7]. In this paper, we adopt 
a weaker definition of regret. First introduced in QD, weak regret measures the performance of 
a policy against a "partially-informed" genie who knows only which arm has the largest reward 
mean instead of the complete system dynamics. This definition of regret leads to a tractable 
problem, but at the same time, weaker results. Whether stronger results for a general RMAB 
under an unknown model can be obtained is still open for exploration (see more discussions in 
Sec. II-CI on related work). 

2) Decentralized Setting: In the decentralized setting, there are M distributed players. At 
each time, a player chooses one arm to play based on its local observations without information 
exchange with other players. Collisions occur when multiple players choose the same arm and 
result in reward loss. The objective here is a decentralized policy to minimize the regret growth 
rate where regret is defined as the performance loss with respect to the ideal case where the 
players know the M best arms and are perfectly orthogonalized among these M best arms 
through centralized scheduling. 

We consider two types of restless reward models: the exogenous restless model and the 
endogenous restless model. In the former, the system itself is rested: the state of an arm does 
not change when the arm is not engaged. However, from each individual player's perspective, 
arms are restless due to actions of other players that are unobservable and uncontrollable. Under 
the endogenous restless model, the state of an arm evolves according to an arbitrary unknown 
random process even when the arm is not played. Under both restless models, we extend the 

'Under the rested Markovian reward model, staying with the best arm (in terms of the steady-state reward mean) is optimal 
up to a loss of O(l) term resulting from the transient effect of the initial state which may not be the stationary distribution (5). 
This O(l) term , however, does not affect the order of the regret. 
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proposed DSEE policy to a decentralized policy that achieves the same logarithmic regret order as 
in the centralized scheduling. We emphasize that the logarithmic regret order is achieved under 
a complete decentralization among players. Players do not need to have synchronized global 
timing; each player can construct the exploration and exploitation epoch sequences according to 
its own local time. 

We point out that the result under the exogenous restless model is stronger than that under 
the endogenous restless model in the sense that the regret is indeed defined with respect to the 
optimal policy under a known reward model and centralized scheduling. This is possible due to 
the inherent rested nature of the systems which makes any orthogonal sharing of the M best 
arms optimal (up to an 0(1) term) under a known reward model. 

C. Related Work on RMAB 

PvMAB with unknown dynamics has not been studied in the literature except two parallel 
independent investigations reported in [0 and ifTOll . both consider only a single player. In BH, 
Tekin and Liu considered the same problem and adopted the same definition of regret as in this 
paper. They proposed a policy that achieves logarithmic (weak) regret when certain knowledge 
about the system parameters is available [9]. Referred to as regenerative cycle algorithm (RCA), 
the policy proposed in BH is based on the UCB1 policy proposed in for the i.i.d. reward model. 
The basic idea of RCA is to play an arm consecutively for a random number of times determined 
by a regenerative cycle of a particular state and arms are selected based on the UCB1 index 
calculated from observations obtained only inside the regenerative cycles (observations obtained 
outside the regenerative cycles are not used in learning). The i.i.d. nature of the regenerative 
cycles reduces the problem to the classic MAB under the i.i.d. reward model. The DSEE policy 
proposed in this paper, however, has a deterministic epoch structure, and all observations are used 
in learning. As shown in the simulation examples in Sec. [TV] DSEE can offer better performance 
than RCA since RCA may have to discard a large number of observations from learning before 
the chosen arm enters a regenerative cycle defined by a particular pilot state. Note that when the 
arm reward state space is large or when the chose pilot state has a small stationary probability, 
it may take a long time for the arm to hit the pilot state, and since the transition probabilities 
are unknown, it is difficult to choose the pilot state for a smaller hitting time. In [10], a strict 
definition of regret was adopted {i.e., the reward loss with respect to the optimal performance in 
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the ideal scenario with a known reward model). However, the problem can only be solved for a 
special class of RMAB with 2 or 3 arms governed by stochastically identical two-state Markov 
chains. For this special RMAB, the problem is tractable due to the semi- universal structure of 
the optimal policy of the corresponding RMAB with known dynamics established in IfTTTl . lfT2ll . 
By exploiting the simple structure of the optimal policy under known Markovian dynamics, Dai 
et al. showed in ifTOll that a regret with an order arbitrarily close to logarithmic can be achieved 
for this special RMAB. 

There are also several recent development on decentralized MAB with multiple players under 
the i.i.d. reward model. In |fT3l . Liu and Zhao proposed a Time Division Fair Sharing (TDFS) 
framework which leads to a family of decentralized fair policies that achieve logarithmic regret 
order under general reward distributions and observation models lfT3l . Under a Bernoulli reward 
model, decentralized MAB was also addressed in |fl4|. [fT5l . where the single-player policy 
UCB1 was extended to the multi -player setting. 

The basic idea of deterministic sequencing of exploration and exploitation was first proposed 
in H| under the i.i.d. reward model. To handle the restless reward model, we introduce the 
epoch structure with epoch lengths carefully chosen to achieve the logarithmic regret order. The 
regret analysis also requires different techniques as compared to the i.i.d. case. Furthermore, the 
extension to the decentralized setting where different players are not required to synchronize in 
their epoch structures is highly nontrivial. 

The results presented in this paper and the related work discussed above are developed within 
the non-Bayesian framework of MAB in which the unknowns in the reward models are treated 
as deterministic quantities and the design objective is universally (over all possible values of the 
unknowns) good policies. The other line of development is within the Bayesian framework in 
which the unknowns are modeled as random variables with known prior distributions and the 
design objective is policies with good average performance (averaged over the prior distributions 
of the unknowns). By treating the posterior probabilistic knowledge (updated from the prior 
distribution using past observations) about the unknowns as the system state, Bellman in 1956 
abstracted and generalized the classic Bayesain MAB to a special class of Markov decision 
processes ifToll . The long-standing Bayesian MAB was solved by Gittins in 1970s where he 
established the optimality of an index policylthe so-called Gittins index policy ifTTl . In 1988, 
Whittle generalized the classic Bayesian MAB to the restless MAB (with known Markovian 
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dynamics) and proposed an index policy based on a Lagrangian relaxation |fT8l . Weber and Weiss 
in 1990 showed that Whittle index policy is asymptotically optimal under certain conditions IU91 , 
||2~0l . In the finite regime, the strong performance of Whittle index policy has been demonstrated 
in numerous examples (see, e.g., GTTl - ll24ll '). 

D. Applications 

The restless multi-armed bandit problem has a broad range of potential applications. For 
example, in a cognitive radio network with dynamic spectrum access E51 . a secondary user 
searches among several channels for idle slots that are temporarily unused by primary users. The 
state of each channel (busy or idle) can be modeled as a two-state Markov chain with unknown 
dynamics. At each time, a secondary user chooses one channel to sense and subsequently transmit 
if the channel is found to be idle. The objective of the secondary user is to maximize the long- 
term throughput by designing an optimal channel selection policy without knowing the traffic 
dynamics of the primary users. The decentralized formulation under the endogenous restless 
model applies to a network of distributed secondary users. 

The results obtained in this paper also apply to opportunistic communication in an unknown 
fading environment. Specifically, each user senses the fading realization of a selected channel 
and chooses its transmission power or data rate accordingly. The reward can be defined to 
capture energy efficiency (for fixed-rate transmission) or throughput. The objective is to design 
the optimal channel selection policies under unknown fading dynamics. Similar problems under 
known fading models have been considered in ll2~6l - ll28Tl . 

Another potential application is financial investment, where a Venture Capital (VC) selects 
one company to invest each year. The state {e.g., annual profit) of each company evolves as 
a Markov chain with the transition matrix depending on whether the company is invested or 
not ||29l . The objective of the VC is to maximize the long-run profit by designing the optimal 
investment strategy without knowing the market dynamics a priori. The case with multiple VCs 
may fit into the decentralized formulation under the exogenous restless model. 

E. Notations and Organization 

For two positive integers k and /, define k l=((k — 1) mod I) + 1, which is an integer taking 
values from 1, 2, ■ • • , I. 
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The rest of the paper is organized as follows. In Sec. HE] we consider the single-player setting. 
We propose the DSEE policy and establish its logarithmic regret order. In Sec. [nil we consider the 
decentralized setting with multiple distributed players. We present several simulation examples in 
Sec. [IV] to compare the performance of DSEE with the policy proposed in (61. Sec. V concludes 
this paper. 

II. The Centralized Setting 

In this section, we consider the centralized, or equivalently, the single-player setting. We first 
present the problem formulation and the definition of regret and then propose the DSEE policy 
and establish its logarithmic regret order. 

A. Problem Formulation 

In the centralized setting, we have one player and N independent arms. At each time, the 
player chooses M arms to play. Each arm, when played, offers certain amount of reward that 
defines the current state of the arm. Let Sj(t) and Sj denote, respectively, the state of arm j at 
time t and the state space of arm j. When arm j is played, its state changes according to an 
unknown Markovian rule with Pj as the transition matrix. The transition matrixes are assumed to 
be irreducible, aperiodic, and reversible. States of passive arms transit according to an arbitrary 
unknown random process. Let ttj = {^j(s)} seS . denote the stationary distribution of arm j under 
Pj. The stationary reward mean /i^ is given by fij = £) a65 S7T j( s )- Let a be a permutation of 
{!,-■■ ,N} such that 



A policy $ is a rule that specifies an arm to play based on the observation history. Let i, (n) 
denote the time index of the nth play on arm j, and Tj(t) the total number of plays on arm j 
by time t. Notice that both t,(n) and Tj{t) are random variables with distributions determined 
by the policy $. The total reward under $ by time t is given by 



N T 3 (t) 




(1) 
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The performance of a policy $ is measured by regret r$(£) defined as the reward loss with 
respect to the best possible single-arm policy: 



where the 0(1) constant term is caused by the transient effect of playing the M best arms when 
their initial states are not given by the stationary distribution, E$ denotes the expectation with 
respect to the random process induced by policy $. The objective is to minimize the growth 
rate of the regret with time t. Note that the constant term does not affect the order of the regret 
and will be omitted in the regret analysis in subsequent sections. 

In the remaining of this section, we will consider first M — 1. Extensions to the general case 
are given in Sec. III-Dl 

B. DSEE with An Epoch Structure 

Compared to the i.i.d. and the rested Markovian reward models, the restless nature of arm state 
evolution requires that each arm be played consecutively for a period of time in order to learn 
its Markovian reward statistics and to approach the steady state. The length of each segment of 
consecutive plays needs to be carefully controlled: it should be short enough to avoid spending 
too much time on a bad arm and, at the same time, long enough to limit the transient effect. 
To balance these factors, we construct a policy based on DSEE with an epoch structure. As 
illustrated in Fig. [Q the proposed policy partitions the time horizon into interleaving exploration 
and exploitation epochs with geometrically growing epoch lengths. In the exploitation epochs, 
the player computes the sample mean (i.e., average reward per play) of each arm based on 
the observations obtained so far and plays the arm with the largest sample mean, which can 
be considered as the current estimated best arm. In the exploration epochs, the player aims to 
learn the reward statistics of all arms by playing them equally many times. The purpose of the 
exploration epochs is to make decisions in the exploitation epochs sufficiently accurate. 

As illustrated in Fig. Q3 in the nth exploration epoch, the player plays every arm 4 re_1 times. 
In the nth exploitation epoch with length 2 x 4 n_1 , the player plays the arm with the largest 
sample mean (denoted as arm a*) determined at the beginning of this epoch. At the end of each 
epoch, whether to start an exploitation epoch or an exploration epoch is determined by whether 
sufficiently many (specifically, D logt as given in © in Fig. [2]) observations have been obtained 
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(2) 
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from every arm in the exploration epochs. This condition ensures that only logarithmically many 
plays are spent in the exploration epochs, which is necessary for achieving the logarithmic regret 
order. This also implies that the exploration epochs are much less frequent than the exploitation 
epochs. Though the exploration epochs can be understood as the "information gathering" phase, 
and the exploitation epochs as the "information utilization" phase, observations obtained in the 
exploitation epochs are also used in learning the arm reward statistics. A complete description 
of the proposed policy is given in Fig. [2l 
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Play the best arm a* (2 X 4 n times) 



Arm l(4 n times) 



Arm N(4 n times) 



Fig. 1. The epoch structure with geometrically growing epoch lengths. 



C. Regret Analysis 

In this section, we show that the proposed policy achieves a logarithmic regret order. This is 
given in the following theorem. 

Theorem 1: Assume that {Pi\f =x are finite state, irreducible, aperiodic and reversible. All the 
reward states are non-negative. Let be the second largest eigenvalue of Pj. Define e min = 

mini<j<j\r 6j, 7T min = mini<j<jv,seSi n i( s )> r max = max l<i<AT X^sgSi S ' 1^1 max = m a x i<«<Af |«Si|, 

A max = maxi(min s& 5 7T*)" 1 V c o s, and L = - — . Assume that the best arm has a 
distinct reward mearo Set the policy parameters D to satisfy the following condition: 

4L 

D > ~, rj, (4) 



2 The extension to the general case is straightforward 
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DSEE with An Epoch Structure 

Time is divided into exploration and exploitation epochs. Let no(t) and nj(t) denote, 
respectively, the numbers of exploration and exploitation epochs up to time t. 

1. At t = 1, the player starts the first exploration epoch with length N, in which every arm 
is played once. Set n (N + 1) = 1, n/(iV + 1) = 0. Then go to Step 2. 

2. Let Xo(t) = (4"° — l)/3 be the time spent on each arm in exploration epochs by time 
t. Choose D according to ©. If 

X (t) >D\ogt, (3) 

go to Step 3. Otherwise, go to Step 4. 

3. Start an exploitation epoch with length 2 x 4 ,lJ_1 . Calculate the sample mean Sj(t) of 
each arm. Play the arm with the largest sample mean. Increase nj by one. Go to Step 2. 

4. Start an exploration epoch with length iV4™° _1 . Play each arm for 4"°~ 1 times. Increase 
no by one. Go to Step 2. 

Fig. 2. DSEE with an epoch structure for RMAB. 



The regret of DSEE at the end of any epoch can be upper bounded by 

r*(*) < C7 i riog 4 (^(f-iV) + l)l+C7 2 [4(3L>logf + l)-l] 



where 



+iVA max (Llog 4 (3Dlogt + 1)J + 1)), (5) 



a 



= ^ - ■ (7) 

Proof: See Appendix A for details. ■ 

In the proposed policy, to ensure the logarithmic regret order, the policy parameter D needs 
to be chosen appropriately. This requires an arbitrary (but nontrivial) bound on r max , e min , and 
A*ct(i) — A*o-(2)- In me ca se where no knowledge about the system is available, D can be chosen to 
increase with time rather than set a priori to achieve a regret order arbitrarily close to logarithmic. 
This is formally stated in the following theorem. 



12 



Theorem 2: Assume that {Pi}fLi are finite state, irreducible, aperiodic and reversible. All the 
reward states are non-negative. For any increasing sequence f(t) (/(£) — > oo as t — > oo), if 
D(t) = f(t), then 

r*(t)~0(/(*) log*). (8) 
Proof: See Appendix B for details. ■ 

D. Extension to M > 1 

For M > 1, the basic structure of DSEE is the same. The only difference is that in the nth 
exploitation epoch with length 2 x 4 n_1 , the player plays the arms with the M largest sample 
means; in the nth exploration epoch with length |~-p~|4 n_1 the player spends 4 n ~ 1 plays on each 
arm and gives up (M[^] — iV) 4 n_1 plays. The regret in this case is given in the following 
theorem. 

Theorem 3: Assume that {P i }^ 1 are finite state, irreducible, aperiodic and reversible. All the 
reward states are non-negative. Let q be the second largest eigenvalue of P,. Define e min = 

mini<j<Ar 6i, 7T min = mmi<i<N,s£Si ^i{ s )^ r max = maXi<j<AT XLeSi S ' I^Uax = maxi<j<jv 

A max = maXj(min s6lS - J2 s es s ' anc ^ ^ = (3~2^2> — ' Assume mat me ^ m best arm arm 
has a distinct reward mean^. Set the policy parameters D to satisfy the following condition: 



4L 

D > - 2 , (9) 



The regret of DSEE at the end of any epoch can be upper bounded by 

r*(*) < driog 4 (^(t-iV) + l)l +C 2 [4(3Dlogt + 1)-1] 



-7Vyl max (Llog 4 (3Plogt+ 1)J + 1), (10) 



where 



j=l i=M+l k=j,i V g ^*6ft S J 

1 ( N M N \ 

° 2 = 3 ^Tjl ~ • (12) 

\ i=l i=l / 



3 The extension to the general case is straightforward 
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Proof: See Appendix C for details. 



Regret at any time t has a upper bound with a logarithmic order similar to (flOl) . with t replaced 
by 4i + 3. In the proposed policy, to ensure the logarithmic regret order, the policy parameter 
D needs to be chosen appropriately. This requires an arbitrary (but nontrivial) bound on r max , 
e min , and /jl^m) ~ /V(a/+i) • hi the case where no knowledge about the system is available, D 
can be chosen to increase with time rather than set a priori to achieve a regret order arbitrarily 
close to logarithmic. This is formally stated in the following theorem. 

Theorem 4: Assume that {Pi\f =x are finite state, irreducible, aperiodic and reversible. All the 
reward states are non-negative. For any increasing sequence /(£) (/(£) — > oo as t — > oo), if 
D(t) = f(t), then 



A. Problem Formulation 

In the decentralized setting, there are M players and N independent arms (N > M). At each 
time, each player chooses one arm to play based on its local observations. As in the single player 
case, the reward state of arm j changes according to a Markovian rule when played, and the 
same set of notations are adopted. For the state transition of a passive arm, we consider two 
models: the endogenous restless model and the exogenous restless model. In the former, the arm 
evolves according to an arbitrary unknown random process even when it is not played. In the 
latter, the system itself is rested. From each individual player's perspective, however, arms are 
restless due to actions of other players that are unobservable and uncontrollable. The players do 
not know the arm dynamics and do not communicate with each other. Collisions occur when 
multiple players select the same arm to play. Different collision models can be adopted, where 
the players in conflict can share the reward or no one receives any reward. In the former, the 
total reward under a policy $ by time t is given by 



r*(t)~0(/(t)logt). 



(13) 



Proof: See Appendix B for details. 



III. The Decentralized Setting 



N Tj(t) 




(14) 



j = l 71=1 
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where for the case conflicted players share the reward Ij(tj(n)) = 1 if arm j is played at least 
one player at time tj(n), and Ij(tj(n)) = otherwise; for the case conflicted players get no 
reward Ij(tj(n)) = 1 if arm j is played one and only one player at time tj(n), and Ij(tj(n)) = 
otherwise. 

Under both restless models and both collision models, regret r$(t) is defined as the reward 
loss with respect to the ideal scenario of a perfect orthoganalization of the M players over the 
M best arms. We thus have 



where the 0(1) constant term comes from the transient effect of the M best arms (similar to 
the single-player setting). Note that under the exogenous restless model, this definition of regret 
is strict in the sense that t J^fLi AVi) + ^(-0 * s indeed the maximal expected reward achievable 
under a known model of the arm dynamics. 

B. Decentralized DSEE Policy 

For the ease of presentation, we first assume that the players are synchronized according 
to a global time. Since the epoch structure of DSEE is deterministic, global timing ensures 
synchronized exploration and exploitation among players. We further assume that the players 
have pre-agreement on the time offset for sharing the arms, determined based on, for example, the 
players' ID. We will show in Sec. IIII-Dl that this requirement on global timing and pre-agreement 
can be eliminated to achieve a complete decentralization. 

The decentralized DSEE has a similar epoch structure. In the exploration epochs (with the nth 
one having length N x 4 n_1 ), the players play all iV arms in a round-robin fashion with different 
offsets determined in the pre-agreement. In the exploitation epochs, each players calculates the 
sample mean of every arm based on its own local observations and plays the arms with the M 
largest sample mean in a round-robin fashion with a certain offset. Note that even though the 
players have different time-sharing offsets, collisions occur during exploitation epochs since the 
players may arrive at different sets and ranks of the M arms due to the randomness in their 
local observations. Each of these M arms is played 2 x 4 n_1 times. The nth exploitation epoch 
thus has length 2M x 4 n_1 . A detailed description of the decentralized DSEE policy is given in 



M 




(15) 



8=1 



Fig.m 
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Decentralized DSEE 

Time is divided into exploration and exploitation epochs with no(t) and nj(t) similarly defined 
as in Fig. [2] 

1. At t = 1, each player starts the first exploration epoch with length TV. Player k plays 
arm (k + t) TV at time t. Set n (N + 1) = 1, n/(TV + 1) = 0. Then go to Step 2. 

2. Let Xo(i) = (4"° — l)/3 be the time spent on each arm in exploration epochs by time 
t. Choose D according to STft . If 

X (t)>D\ogt, (16) 

go to Step 3. Otherwise, go to Step 4. 

3. Start an exploitation epoch with length 2M x 4™ r_1 . Calculate sample mean Si(t) of each 
arm and denote the arms with the M largest sample means as arm a* to arm a* M . Each 
exploitation epoch is divided into M subepochs with each having a length of 2 x 4" / ~ 1 . 
Player k plays arm a? fe+m s 0M in the mth subepoch. Increase nj by one. Go to step 2. 

4. Start an exploration epoch with length TV x 4"° . Each exploration epoch is divided 
into TV subepochs with each having a length of 4 no_1 . Player fc plays arm a( m+ fc) jv in 
the mth subepoch. Increase no by one. Go to step 2. 

Fig. 3. Decentralized DSEE policy for RMAB. 



C. Regret Analysis 

In this section, we show that the decentralized DSEE policy achieves the same logarithmic 
regret order as in the centralized setting. 

Theorem 5: Under the same notations and definitions as in Theorem [H assume that different 
arms have different mean values^. Set the policy parameter D to satisfy the following condition: 

4L 

D > — ^. (17) 

Under sharing reward conflict model, the regret of the decentralized DSEE at the end of any 
epoch can be upper bounded by 

r*(t) <Ci[log 4 (— + !)] +C 2 (Llog 4 (37Jlogt+l)J +l) + C 3 [4(3Dlogt + l)-l], (18) 



4 This assumption can be relaxed when the players determine the round-robin order of the arms based on pre-agreed arm 
indexes rather than the estimated arm rank. This assumption is only for simplicity of the presentation. 
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where 



Co 



Z^m=l r"(™) 7r min Z-o=l l^i=l,i^j l^k=i,j \\og2 ~ 10£ seSfc V ' fe| max ' 

Endogenous restless and zero-reward collision model 

3M ST M ,, , V ( _J i V2e k VZ \ I o I , 4 

7r min 2-^=1 /Mi) 2^i=i,i^j 2^k=i,j ^i og 2 10E seSt; s / 1 fe| max ' 

Endogenous restless and partial-reward collision model 



V M // 3A7 V^AJ V^JV / _J I V2e k VL \ I C I 

Z^-m=l r<r(m) Wmin Z-rj=l Z^i=l,i^j 2-ik=ij \\o%2 ' 10J2seS k s J ' '' 

Exogenous restless and zero-reward collision model 

Tmin ^j = l/Mj) 2^i=l,i^j l^k=i,j ^log2 ^ 10 E s 



(19) 



Exogenous restless and partial-reward collision model 



NMA max , Endogenous restless model 
0, Exogenous restless model 

M N 

n ^) - M ^) 



(20) 
(21) 



i=i 



i=i 



Proof: See Appendix D for details. ■ 
Achieving the logarithmic regret order requires an arbitrary (but nontrivial) bound on r max , 
min J < M (yU (T (j) — /ig-^+x)), and e min . Similarly to the single-player case, D can be chosen to 
increase with time to achieve a regret order arbitrarily close to logarithmic as stated below. 

Theorem 6: Under the same notations and definitions as in Theorem [H assume that different 
arms have different mean values. For any increasing sequence f(t) (f(i) — > oo as t — > oo), if 
D(t) is chosen such that D(t) = f(t), then under both the endogeneous and exogeneous restless 
models, 



r*(t)~0(/(t)logt). 
Proof: See Appendix E for details. 



(22) 



D. In the Absence of Global Synchronization and P re-agreement 

In this section, we show that the requirement on global synchronization and pre-agreement 
can be eliminated while maintaining the logarithmic order of the policy. As a result, players can 
join the system at different times. 
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Without global timing and pre-agreement, each player has its own exploration and exploitation 
epoch timing. The epoch structure of each player's local policy is similar to that given in Fig. [3] 
The only difference is that in each exploitation epoch, instead of playing the top M arms (in 
terms of sample mean) in a round-robin fashion, the player randomly and uniformly chooses one 
of them to play. When a collision occurs during the exploitation epoch, the player makes another 
random and uniform selection among the top M arms. As shown in the proof of Theorem [7J 
this simple adjustment based on collisions achieves efficient sharing among all players without 
global synchronization and pre-agreement. Note that during an exploration epoch, the player 
plays all N arms in a round-robin fashion without reacting to collisions. Since the players still 
observe the reward state of the chosen arm, collisions affect only the immediate reward but not 
the learning ability of each player. As a consequence, collisions during a player's exploration 
epochs will not affect the logarithmic regret order since the total length of exploration epochs 
is at the logarithmic order. The key to establishing the logarithmic regret order in the absence 
of global synchronization and pre-agreement is to show that collisions during each player's 
exploitation epochs are properly bounded and efficient sharing can be achieved. 

Theorem 7: Under the same notations and definitions as in Theorem [H Decentralized DSEE 
without global synchronization and pre-agreement achieves logarithmic regret order. 

Proof: See Appendix F for details. ■ 

The assumption that the arm reward state is still observed when collisions occur holds in 
many applications. For example, in the applications of dynamic spectrum access and oppor- 
tunistic communications under unknown fading, each user first senses the state (busy/idle or 
the fading condition) of the chosen channel before a potential transmission. Channel states are 
always observed regardless of collisions. The problem is much more complex when collisions 
are unobservable and each player involved in a collision only observes its own local reward 
(which does not reflect the reward state of the chosen arm). In this case, collisions result in 
corrupted measurements that cannot be easily screened out, and learning from these corrupted 
measurements may lead to misidentified arm rank. How to achieve the logarithmic regret order 
without global timing and pre-agreement in this case is still an open problem. 
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IV. Simulation Results 

In this section, we study the performance of DSEE as compared to the RCA policy proposed 
in [0. The first example is in the context of cognitive radio networks. We consider that a 
secondary user searches for idle channels unused by the primary network. Assume that the 
spectrum consists of N independent channels. The state — busy (0) or idle (1) — of each channel 
(say, channel n) evolves as a Markov chain with transition probabilities {p™ } i, j E {0, 1}. At 
each time, the secondary user selects a channel to sense and choose the transmission power 
according to the channel state. The reward obtained from a transmission over channel n in state 
i is given by r". We use the same set of parameters chosen in j6l (given in the caption of 
Fig. HI). We observe from Fig. @] that RCA initially outperforms DSEE for a short period, but 
DSEE offers significantly better performance as time goes, and the regret offered by RCA does 
not seem to converge to the logarithmic order in a horizon of length 10 4 . We also note that while 
the condition on the policy parameter D given in © is sufficient for the logarithmic regret order, 
it is not necessary. Fig. |4] clearly shows the convergence to the logarithmic regret order for a 
small value of D, which leads to better finite-time performance. 

In the next example, we consider a case with a relatively large reward state space. We consider 
a case with 5 arms, each having 20 states. Rewards from each state for arm 2 to arm 5 is 
[1, 2, • • • 20]. Rewards from each state for arm 1 is 1.5 x [1, 2, • • • 20] (to make it a better arm 
than the rest). Transition probabilities of all arms were generated randomly and can be found 
in Appendix G. The stationary distributions of all arms are close to uniform, which avoids the 
most negative effect of randomly chosen pilot states in RCA. The values of D in DSEE and L 
in RCA were chosen to be the minimum as long as the ratio of the regret to log t converges to 
a constant with a reasonable time horizon. We again observe a better performance from DSEE 
as shown in Fig. \5\ 

The better performance of DSEE over RCA may come from the fact that DSEE learns from 
all observations while RCA only uses observations within the regenerative cycles in learning. 
When the arm reward state space is large or the randomly chosen pilot state that defines the 
regenerative cycle has a small stationary probability, RCA may have to discard a large number 
of observations from learning. 
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time t 



Fig. 4. Regret for DSEE and RCA, p 01 = [0.1,0.1,0.5,0.1,0.1], p m = [0.2,0.3,0.1,0.4,0.5], n = [1,1,1,1,1], r = 
[0.1, 0.1, 0.1, 0.1, 0.1], D = 10, L = 10, 100 Monte Carlo runs. 



V. Conclusion 

In this paper, we studied the restless multi-armed bandit (RMAB) problem with unknown 
dynamics under both centralized (single-player) and decentralized settings. We developed a policy 
based on a deterministic sequencing of exploration and exploitation with geometrically growing 
epochs that achieves the logarithmic regret order. In particular, in the decentralized setting with 
multiple distributed players, the proposed policy achieves a complete decentralization for both 
the exogenous and endogenous restless models. 



Appendix A. Proof of Theorem CD and Theorem [3] 
We first rewrite the definition of regret as 

r$(t) = fyt CT (i) - E$R(t) 

m) 



N 



m) 

J2[^MT i (t)]-E[Y^s l (t i 



n 



(23) 
(24) 



20 




Fig. 5. Regret for DSEE and RCA with 5 arms, 20 states, L = 20, D = 1.8, 1000 Monte Carlo runs. 



To show that the regret has a logarithmic order, it is sufficient to show that the two terms in 
(T24l) have logarithmic orders. The first term in (l24l) can be considered as the regret caused by 
transient effect. The second term can be considered as the regret caused by engaging a bad arm. 
First, we bound the regret caused by transient effect based on the following lemma. 

Lemma 1 0: Consider an irreducible, aperiodic Markov chain with state space S, transition 
probabilities P, an initial distribution q which is positive in all states, and stationary distribution 
7r (tt s is the stationary probability of state s). The state (reward) at time t is denoted by s(t). 
Let \i denote the mean reward. If we play the chain for an arbitrary time T, then there exists a 

value A P < (mm se s its)" 1 J2 s &s s sucn mat ^Et=i S W ~ A^l — Ap- 

Lemma 1 shows that if the player continues to play an arm for time T, the difference between 
the expected reward and T/i can be bounded by a constant that is independent of T. This constant 
is an upper bound for the regret caused by each arm switching. If there are only logarithmically 
many arm switchings as times goes, the regret caused by arm switching has a logarithmic order. 
An upper bound on the number of arm switchings is shown below. It is developed by bounding 
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the numbers of the exploration epochs and the exploitation epochs respectively. 

For the exploration epochs, by time t, if the player has started the (n+l)th exploration epoch, 
we have 

1(4" -1)<D log*, (25) 

where |(4 n — 1) is the time spent on each arm in the first n exploration epochs. Consequently 
the number of the exploration epochs can be bounded by 

no(t) < Llog 4 (3Dlog* + 1)J + 1. (26) 
By time t, at most (t — N) time slots have been spent on the exploitation epochs. Thus 

n T (t)< flog 4 (|(t-iV) + l)l. (27) 



Hence an logarithmic upper bound of the first term in (1241) is 

N Ti(t) 

J2l^nT i (t)]-E[J2s l (U(n))]] < (riog 4 (-(t-7V) + l)]+iV(Llog 4 (3Dlogt+l)J+l))^ max .(28) 

i=l n=l 



Next we show that the second term of (1241) has a logarithmic order by bounding the total time 
spent on the bad arms. We first bound the time spent on the bad arms during the exploration 
epochs. Let To(t) denote the time spent on each arm in the exploration epochs by time t. By 
(|26|) . we have 

T (t)<^{3Dlogt+l)-l}. (29) 
Thus regret caused by playing bad arms in the exploration epochs is 

l[4(3£>logt+ 1) - 1] (n^ (1) -5>*w) • (30) 

Next, we bound the time spent on the bad arms during the exploitation epochs. Let t n denote 
the starting point of the nth exploitation epoch. Let Pr[i,j, n] denote the probability that arm 
i has a larger sample mean than arm j at t n when arm j is the best arm, i.e., Pr[i,j, n] is the 
probability of making a mistake in the nth exploitation epoch. Let Wi and Wj denote, respectively, 
the number of plays on arm % and arm j by t n . Let C t:W = \J (L \ogt/w). We have 

Pr[i,j,n] < Pr[si(t n ) > sj{t n )] (31) 

< Pi[sj(t n ) <Hj- C tntW .] + Pr[sj(t n ) > ^ + C tn , Wl ] 

+ Pr^ < ^ + C tnjW . + C tn)W .]) (32) 

< Pr[sj(t n ) < ft - C tn>W] ] + Px[si(t n ) > /i 4 + C tntWi ], (33) 
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where (1331 follows from the fact that w.i> D log t n and Wj > D log t n and the condition on D 
given in ©. 

Next we bound the two quantities in (|33l) . Consider first the second term Pr[sj(t„) > fa + 
Ct n , Wi } = P*[wiSi(t n ) > Wifa + i/^Wj log t n \ . Note that the total Wi plays on arm % consists 
of multiple contiguous segments of the Markov sample path, each in a different epoch. Let 
K denote the number of such segments. From the geometric growth of the epoch lengths, we 
can see that the length of each segment is in the form of 2 kl ( I = 1, • • ■ , K) with ki% being 
distinct. Without loss of generality, let k\ < hi < ■ ■ ■ < kx- Note that Wi, K, and kfs are 
random variables. The derivation below holds for every realization of these random variables. 
Let Ri{l) denote the total reward obtained during the Zth segment. Notice that Wi = Ylf=i 2 kl 
and y/wl > E£i(V2 - We then have 

Pr ^jSj(t n ) > Wifa + a/ Lwi log t n (34) 

" K K K 



< Pr 



Pr 



l=i 

K 



1=1 
K 



1=1 
A 



Pr 



A 



Ri(i) - 1* Yl 2kl - ^/L^t~n{V2 - 1) ^ > o 
.1=1 i=i i=i 

' K 

(HI) - to^ 1 - y/L\ogt n {y/2 - 1)V2*) > 

.i=i 

< Pr [^(0 - to 2 * - VL\ogt n (V2 -l)V2fi> 
i=i 

K 

J^Pr \Ri(l) - fn2 kt > y/Llogt n (V2-l)y/2- 



i=i 

A 



5> 



i=i 



J2(sO s t (l) - s2 k '- 1 7r i s ) > v^gX(V2 - 1)V2 

seSi 



(36) 
(37) 
(38) 
(39) 
(40) 



where Of (I) denote the number of occurrences of state s on arm i in the /th segment. The 
following Chernoff Bound will be used to bound (l40l) . 

Lemma 2 (Chernoff Bound, Theorem 2.1 in ||30"1 ): Consider a finite state, irreducible, aperi- 
odic and reversible Markov chain with state space S, transition probabilities P, and an initial 
distribution q. Let iV q = \{^-),x G «S| 2 . Let e be the eigenvalue gap given by 1 — A 2 , where A 2 
is the second largest eigenvalue of the matrix P. Let A C S and 7a (i) be the number of times 
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that states in A are visited up to time t. Then for any 7 > 0, we have 

10t' 



Pr(T A (f) - tn A > 7) < (1 + £)N^<'**. 



Using Lemma 2, we have 
Pr 



Pr 



Pr 



Pr 



E ( a °'(0 - s2fc!_1 < - v^g^^ - (^)) > 



< E Pr 



s0?(i) - s2 fc '- 1 7rj - v / Tbgi;(v / 2 - 1) ( J > 



< E Pr 

= |<sj 1 1 

Thus we have 



< Pr o?(0-2 fc '- 1 7r;> VIii^(^-i)v^(^- 



0?(Z) - 2 k '- 1 7rl > y/L\ogt n (V2 - l)V2 k 



1 



(x/2-l)e iV /^logtn 1 



N^t r 



((3-2 v / 2)L e! )/(20(E 3eSi *) 2 )) 



Pr 



WiSiit n ) > Willi + ^/ LWj log t„ 

((3-2V5)L 6 «)/(20(E, 6S . *) 2 )) 



s/2~€iVL logt n -(3-2v / 2)(Le7(20(E ses , *) 2 )) 

-[Oil . „^ iNUtr. 



K 



V2eiy/L log £ r 



< 



< 



log ^ A/2e iV /L log t n 
log 2 10 E.^ a 

1 V^QV 7 ! 



(3-2v / 2)(Le7(20(E seS , 



|<S;|AU; 



-(3-2^)(LeV(20(E se& *) 2 )) 



log 2 10E S65 ^ 



C I /\r l/2-(3-2v / 2)(Le7(20(E seSl *) 2 )) 
<~>i\l\qit n , 



(41) 

(42) 

(43) 

(44) 

(45) 
(46) 
(47) 
(48) 

(49) 
(50) 

(51) 
(52) 

(53) 

(54) 
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where ( 1531 ) follows from the fact K < \og 2 t n . Since L > 



Pr[si(t n ) >& + C tn!Wi ] < 
Similarly, it can be shown that 



(3-2 v / 2)( 



log 2 lOEaea- 



log 2 10E s6 5,s 



, we arrive at 



(SjliVqi^ 1 . 



I^'l^q^n 1 - 



(55) 



(56) 



Thus 



Pr[i,j,n] < 



1 v^e.VZ 



log 2 10£ se ,s 



to I 



1 V^iV 7 ^ 

log2 + 10 Eses t s 



(57) 



Thus the regret caused by engaging bad arms in the nth exploitation epoch is bounded by 



3=2 



1 



y/2ejVL 
log2 ' 10 Eses, s 



\Si\ 



1 



v/2eiVI 
log2 ' 10 Ese Sl * 



By dTTj and t n > |4 n_1 , the bound in d69]) becomes 



3 1 N ( 1 

3[log 4 (-(t-iV) + l)l-— X)(W-A^)) E w 

Z TTmin 2 fe=1 ^ lOg 



g2 r ioE se5i .^ 



15,1. 



(59) 



J=2 fc=J-J \ ~ ^s£S k 

Combining (|24|) (1281) (1301) (l59l) . we arrive at the upper bound of regret given in ©. 

We point out that the same Chernoff bound given in Lemma 2 is also used in |6] to handle the 
rested Markovian reward MAB problem. Note that the Cheroff bound in lf3~0ll requires that all the 
observations used in calculating the sample means (s, and Sj in (1331) ) are from a continuously 
evolving Markov process. This condition is naturally satisfied in the rested MAB problem. 
However, for the restless MAB problem considered here, the sample means are calculated using 
observations from multiple epochs, which are noncontiguous segments of the Markovian sample 
path. As detailed in the above proof, the desired bound on the probabilities of the events in (1331) 
is ensured by the carefully chosen (growing) lengths of the exploration and exploitation epochs. 

Appendix B. Proof of Theorem [2] and Theorem 0] 

Recall in Theorem 1 and Theorem 3, L and D are fixed a priori. Now we choose L{t) — > oo 
as t — > oo and ^® — > oo as t — > oo. By the same reasoning in the proof of Theorem [Q the 
regret has three parts: The regret caused by arm switching, the regret caused by playing bad 
arms in the exploration epochs, and the regret caused by playing bad arms in the exploitation 
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epochs. It will be shown that each part part of the regret is on a lower order or on the same 
order of f(i) log! 

The number of arm switchings is upper bounded by N\og 2 (t/N + 1). So the regret caused 
by arm switching is upper bounded by 

iVlog 2 (t/iV+l)A max . (60) 
Since f(i) — > oo as t — > oo, we have 

lim ± = Q 

t^oc f(t)l0gt 

Thus the regret caused by arm switching is on a lower order than f(t) logi. 
The regret caused by playing bad arms in the exploration epochs is bounded by 

1 ( N M - \ 

g[4(3D(t) logt + 1) - 1] [-1 ~ • («) 

V i=i i=i / 

Thus the regret caused by playing bad arms in the exploration epochs is on the same order of 
f(t) logt 

For the regret caused by playing bad arms in the exploitation epochs, it is shown below that 
the time spent on a bad arm % can be bounded by a constant independent of t. Since — > oo 
as t — > oo, there exists a time U such that Vt > U, D(t) > -, — — «. There also exists 
a time t 2 such that Vi > t 2 , L(t) > 7 ° r ^ x — . The time spent on playing bad arms before 
t 3 = max(ti, fa) is at most £3, and the caused regret is at most (X)i=i M<r(i))^3- After t 3 , the time 
spent on each bad arm i is upper bounded by (following similar reasoning from (|32|) to (l59l) ) 

7r 2 |St| + |<S ff (i)| ^ e nmx ^L(t^ 

2 TTmin 10s mm 



(63) 



An upper bound for the corresponding regret is 



T L L (a^j - ^)) 22 1^2 + ioe^ |lSfc t' (64) 

3=1 j=M+l fc=l,3 \ to ^s6S fc / mm 

which is a constant independent of time t. Thus the regret caused by playing bad arms in the 
exploration epochs is on a lower order than f(t) logt. 

Because each part of the regret is on a lower order than or on the same order of f(i) logt, 
the total regret is on the same order of f(t) logt. 
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Appendix C. Proof of Theorem [3] 
For the case of M > 1, we first rewrite the definition of regret as 



M 



r*(t) 



* E ^(0 ~ E *^(*) 



i=l 

A' 



M 



N 



1=1 



t=l 



(65) 
(66) 



i=l n=l 

To show that the regret has a logarithmic order, it is sufficient to show that the two terms in 
(1661) have logarithmic orders. The first term in (1661) can be considered as the regret caused by 
transient effect. The second term can be considered as the regret caused by engaging a bad arm. 
Similar to what we have done for M = 1, we bound the regret caused by transient effect based 
on Lemma 1 and upper bounds on numbera of epochs in (|26l) and dTTT) . An logarithmic upper 
bound of the first term in (l66l) is 



iV 



Y^iWiit)] - s Mn))\] < (M\\og 4 (-(t -N) + I)] + N([\og 4 (3D\ogt + 1)J + l))A max .(67) 

i=l n=l 



Next we show that the second term of (1661) has a logarithmic order by bounding the total time 
spent on the bad arms. We first bound the time spent on the bad arms during the exploration 
epochs. By ( |29l) the regret caused by playing bad arms in the exploration epochs is 



1 ( N M N \ 

-[4(3Dbgf+l)-l] hrrl Z>« 

V i=i i=i / 



(68) 



Next, we bound the time spent on the bad arms during the exploitation epochs. By (1571 ) the 
regret caused by engaging bad arms in the nth exploitation epoch is bounded by 

1 y/2ejVZ 



M N 

4n ~ l2 5Z5Z(/M*)-/W 

i=l j=2 



log 2 lOEses^ 



1 v^exv 7 ! 
+ 



log 2 10E. ea * 



1 



r n \(69) 



7T r 



By d27j and t n > I4"" 1 , the bound in ([69]) becomes 

3 



3 1 M N ( 1 

\\o U {-(t - N) + l)] — - E ^ ~ WO) E w 

j=l i=Ai+l k=j,i \ 



g2 10E s6 5 fc ^ 

Combining (1661) (1671) (1681) (1701) . we arrive at the upper bound of regret given in (ITOl) . 



(70) 
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Appendix D. Proof of Theorem 5 
We first rewrite the definition of regret as 



M 



r*(t) = tY,^(i)-^R{t) (71) 



i=i 

N Ti{t) 
i=l n=l 



M N 



i=l i=l 



(72) 



Using Lemma 1 the first term in (1721) can be bounded by the following constant under the 

endogenous restless model (it is zero under the exogenous model): 

3+ 

(M[log 4 (— + l)] + jV(|_log 4 (3£>logt + 1)J + l))Mi ffiaxi (73) 

which has a logarithmic order. 

We are going to show that the second term in (1711 ) has a logarithmic order. It will be verified 
by bounding regret in both exploitation and exploration epochs by logarithmic order. 

The upper bound on T Q (t) in (|29l ) still holds and consequently the regret caused by engaging 
bad arms in the exploration epochs by time t is upper bounded by 

1 ( M - \ 

-[4(3£>logt+l)-l] \Nj2^)-Mj2^)J- (74) 



The second reason for regret in the second term of (1711) is not playing the expected arms in 
the exploitation epochs. If in the mth subepoch player k plays the (m+ k) M best arm, then 
every time the best M arms are played and there is no conflict. But arm fl( m+fc ) 0Af may not be 
the (m + k) M best arm. Bounding the probabilities of mistakes can lead to an upper bound 
on the regret caused in the exploitation epochs. 

We adopt the same notations in Appendix A. The upper bound on Pr[i,j,n] in (l57l) still 
holds. Since different subepochs in the exploitation epochs are symmetric, the expected regret in 
different subepochs are the same. In the first subepoch, player k aims at arm a(k). In the model 
where no player in conflict gets any reward, player k failing to identify arm a(k) in the first 
subepoch of the nth exploitation epoch can lead to a regret no more than Yl,m=i x 4 n_1 . 
In the model where players share the reward, player k failing to identify arm a(k) in the first 
subepoch of the nth exploitation epoch can lead to a regret no more than 2/1^ x 4 n ~ 1 . Thus 
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an upper bound for regret in the nth exploitation epoch for no reward conflict model can be 
obtained as 

E-E EE + igg) 

m=l j = l i=l,iy£j k=i,j \ '■stJfc / 

and an upper bound for regret in the nth exploitation epoch for sharing reward model can be 
obtained as 

j=l i=l,i^j k=i,j 

By time t, we have 



4-2mc^E/v 0) E E + W ™ 



3+ 

ni(t)< flog 4 (— + 1)1. (77) 



From the upper bound on the number of the exploitation epochs given in (1271) . and also the fact 
that t n > §4 n_1 , we have the following upper bound on regret caused in the exploitation epochs 
under no reward conflict model by time t (Denoted by r^j(t)): 

r*M < 3Mriog l( iL + l)l£> m ^f; t W^ + ^^WtS) 

lM m=l 71111111 i=l i=l,i^ fe=ij V 10g Z iU S J 

and the upper bound under sharing reward conflict model 

r«« < BMnog^ii+Di^f;^, £ £ ( 1 + ^LW (79) 

/iW 7rmm i=i i=i,i^ fc=ij V 10g Z iU <^«eS* s / 



Combining (1711) (1731) (1301) (1781) (1791 ), we arrive at the upper bounds of regret given in (fT8l) . 

Appendix E. Proof of Theorem [6] 

We set L(t) — > oo as t — > oo and jjQ — >■ oo as t — > oo. The regret has three parts: the 
transient effect of arms, the regret caused by playing bad arms in the exploration epochs, and 
the regret caused by mistakes in the exploitation epochs. It will be shown that each part of the 
regret is on a lower order or at the same order of f(t) logt. The transient effect of arms is the 
same as in Theorem 3. Thus it is upper bounded by a constant under the exogenous restless 
model and on the order of logt under the endogenous restless model. Thus it is on a lower order 
than f(t) logt 

The regret caused by playing bad arms in the exploration epochs is bounded by 
1 ( M - \ 

3 [4(3D(t) logt + 1) - 1] iNj2 MO " M Yl ^(0 • ( 8 °) 

\ i=l i=l / 
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Since D(t) = f(t), regret in (f80b is on the same order f(t) logt. 

For the regret caused by playing bad arms in the exploitation epochs, it is shown below that 
the time spent on a bad arm i can be bounded by a constant independent of t. 

Since ^nl — > oo as t — > oo, there exists a time U such that Vi > U, D(t) > j—. , 4L ^ * 

There also exists a time t 2 such that Vt > t 2 , > ^^y^e — ' ^ ne t * me s P ent on pl a y m S bad 
arms before t 3 = max(ti,t 2 ) is at most t 3 , and the time spent on playing bad arms after t 3 is 
also bounded by a finite constant, which can be found in a similar manner in Appendix B. Thus 
the regret caused by mistakes in the exploitation epochs is on a lower order than f(t) logi. 

Because each part of the regret is on a lower order than or on the same order of f(t) logt, 
the total regret is on the same order of f(t) logt. 

Appendix F. Proof of Theorem |7] 

At each time, regret incurs if one of the following three events happens: (i) at least one player 
incorrectly identifies the set of M best arms in the exploitation sequence, (ii) at least one player 
is exploring, (iii) at least a collision occurs among the players. In the following, we will bound 
the expected number of the occurrences of these three events by the logarithmic order with time. 

We first consider events (i) and (ii). Define a singular slot as the time slot in which either (i) 
or (ii) occurs. Based on the previous theorems, the local expected number of learning mistakes 
at each players is bounded by the logarithmic order with time. Furthermore, the cardinality of 
the local exploration sequence at each player is also bounded by the logarithmic order with time. 
We thus have that the expected number of singular slots is bounded by the logarithmic order 
with time, i.e., the expected number of the occurrences of events (i) and (ii) is bounded by the 
logarithmic order with time. 

To prove the theorem, it remains to show that the expected number of collisions in all non- 
singular slots is also bounded by the logarithmic order with time. Consider the contiguous period 
consisting of all slots between two successive singular slots. During this period, all players 
correctly identify the M best arms and a collision occurs if and only if at least two players 
choose the same arm. Due to the randomized arm selection after each collision, it is clear that, 
in this period, the expected number of collisions before all players are orthogonalized into the 
M best arms is bounded by a constant uniform over time. Since the expected number of such 
periods has the same order as the expected number of singular slots, the expected number of such 
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periods is bounded by the logarithmic order with time. The expected number of collisions over 
all such periods is thus bounded by the logarithmic order with time, i.e., the expected number 
of collisions in all non-singular slots is bounded by the logarithmic order with time. We thus 
proved the theorem. 

Appendix G. Transition Matrix for Simulation in Sec. [TV] 



The transition matrix for arm 1 is 






0401 





0787 





0188 





0572 





0531 





0569 





0491 





0145 





0583 





0244 





0195 





0694 





0654 





0256 





0656 





0707 





0809 





0283 


0322 0.0914 





0787 





0448 





0677 





0165 





0674 





0545 





0537 


(] 


0622 





0653 





0491 





0163 





0613 





0679 


(] 


0580 


(] 


0216 





0580 





0042 





0219 


0650 0.0659 


(] 


0188 





0677 





0885 





0107 





0518 





0687 





0243 


(] 


0997 





0562 





0663 





0674 





0005 





1048 


(] 


0571 


(] 


0562 





0411 





0125 





0308 


0593 0.0176 





0572 





0165 





0107 





0083 





0520 





0802 





0310 





0731 





0967 





0697 





0773 





0630 





0222 





0229 





0910 





0036 





0925 





0180 


0049 0.1091 





0531 





0674 





0518 





0520 





0025 





0801 





0935 





0495 





0076 





0097 





0318 





1150 





1095 





0355 





0664 





0160 





0449 





0321 


0748 0.0069 


(] 


0569 





0545 





0687 





0802 





0801 





0551 


(] 


0741 


(] 


0426 





0085 





0405 





0642 





0234 





0055 





0196 


(] 


0744 





0095 





0804 





0468 


0559 0.0590 


(] 


0491 





0537 





0243 





0310 





0935 





0741 


(] 


0003 





0852 





0199 





0733 





1019 





0055 





0491 





0898 


(] 


0580 





1040 





0531 





0114 


0193 0.0036 





0145 





0622 





0997 





0731 





0495 





0426 





0852 





0028 





0722 





0563 





0690 





0234 





0237 





0496 





0626 





0737 





0442 





0496 


0344 0.0117 





0583 





0653 





0562 





0967 





0076 





0085 





0199 





0722 





0189 





0649 





1013 





0736 





0413 





0371 





0376 





0961 





0072 





0672 


0507 0.0195 


(] 


0244 





0491 





0663 





0697 





0097 





0405 


(} 


0733 


(] 


0563 





0649 





1114 





0887 





0175 





0031 


(] 


0690 


(] 


0260 





0090 





0596 





1015 


0405 0.0195 


(] 


0195 





0163 





0674 





0773 





0318 





0642 


(} 


1019 


(] 


0690 





1013 





0887 





0747 





0866 





0428 


(] 


0089 


(] 


0152 





0428 





0287 





0337 


0083 0.0212 


(] 


0694 





0613 





0005 





0630 





1150 





0234 





0055 


(] 


0234 





0736 





0175 





0866 





0961 





0235 


(] 


0617 


(] 


0261 





1233 





0238 





0417 


0177 0.0469 


(] 


0654 





0679 





1048 





0222 





1095 





0055 


(} 


0491 


(] 


0237 





0413 





0031 





0428 





0235 





0611 


(] 


0354 


(] 


0705 





0817 





0815 





0221 


0590 0.0298 





0256 





0580 





0571 





0229 





0355 





0196 





0898 





0496 





0371 





0690 





0089 





0617 





0354 





0150 





1057 





0951 





0401 





1038 


0466 0.0235 





0656 





0216 





0562 





0910 





0664 





0744 


(] 


0580 





0626 





0376 





0260 





0152 





0261 





0705 


(] 


1057 


(] 


0657 





0436 





0088 





0665 


0035 0.0350 


(] 


0707 





0580 





0411 





0036 





0160 





0095 


(] 


1040 





0737 





0961 





0090 





0428 





1233 





0817 





0951 


(] 


0436 





0289 





0080 





0368 


0400 0.0182 





0809 





0042 





0125 





0925 





0449 





0804 





0531 





0442 





0072 





0596 





0287 





0238 





0815 





0401 





0088 





0080 





0110 





0231 


2495 0.0459 





0283 





0219 





0308 





0180 





0321 





0468 





0114 





0496 





0672 





1015 





0337 





0417 





0221 





1038 





0665 





0368 





0231 





1075 


1164 0.0409 





0322 





0650 





0593 





0049 





0748 





0559 





0193 





0344 





0507 





0405 





0083 





0177 





0590 





0466 





0035 





0400 





2495 





1164 


0103 0.0118 


(] 


0914 





0659 





0176 





1091 





0069 





0590 


(] 


0036 


(] 


0117 





0195 





0195 





0212 





0469 





0298 


(] 


0235 


(] 


0350 





0182 





0459 





0409 


0118 0.3227 


The transition 


matrix for arm 2 


IS 












































(i 


0266 





0415 





0248 





0896 





0596 





0615 


(] 


0847 


(] 


0106 





0175 





0734 





0361 





0888 





0906 


(] 


0118 


(] 


0829 





0442 





0542 





0112 


0439 0.0467 





0415 





0544 





0473 





0287 





0852 





0084 





0224 





0515 





0696 





0496 





0397 





0890 





0919 





0244 





0427 





0088 





0808 





0269 


0845 0.0529 





0248 





0473 





0164 





0902 





0705 





0828 





0828 





0492 





0346 





0413 





0637 





0804 





0078 





0321 





0798 





0250 





0757 





0692 


0115 0.0151 


(] 


0896 





0287 





0902 





0768 





0505 





0020 


(} 


0499 


(] 


0578 





0023 





0545 





0633 





0886 





0395 





0528 


(] 


0389 





0927 





0429 





0272 


0365 0.0154 


(] 


0596 





0852 





0705 





0505 





0840 





0984 


(] 


0415 


(] 


0660 





0336 





0713 





0017 





0126 





0660 


(] 


0249 


(] 


0341 





0006 





0903 





0715 


0263 0.0113 


(] 


0615 





0084 





0828 





0020 





0984 





0812 


(} 


0611 


(] 


0935 





0379 





0536 





0779 





0592 





0713 


(] 


0083 


(] 


0052 





0616 





0347 





0617 


0110 0.0288 





0847 





0224 





0828 





0499 





0415 





0611 





0638 





0540 





0362 





0761 





0192 





0696 





0350 





0656 





0162 





0684 





0126 





0582 


0690 0.0139 





0106 





0515 





0492 





0578 





0660 





0935 





0540 





0470 





0404 





0705 





0865 





0450 





0264 





0516 





0119 





0535 





0694 





0803 


0165 0.0185 


(] 


0175 





0696 





0346 





0023 





0336 





0379 


(] 


0362 


(] 


0404 





0758 





0893 





0655 





0721 





0842 


(] 


0803 


(] 


0595 





0101 





0311 





0158 


0845 0.0597 





0734 





0496 





0413 





0545 





0713 





0536 


(] 


0761 


(] 


0705 





0893 





0579 





0394 





0287 





0564 


(] 


0624 


(] 


0566 





0089 





0536 





0441 


0050 0.0073 





0361 





0397 





0637 





0633 





0017 





0779 





0192 





0865 





0655 





0394 





0548 





0033 





0200 





0820 





0081 





1193 





0826 





0816 


0243 0.0311 





0888 





0890 





0804 





0886 





0126 





0592 





0696 





0450 





0721 





0287 





0033 





0602 





0345 





0537 





0197 





0666 





0065 





0160 


0602 0.0455 





0906 





0919 





0078 





0395 





0660 





0713 





0350 





0264 





0842 





0564 





0200 





0345 





0449 





0559 





0525 





0589 





0433 





0657 


0115 0.0435 


(] 


0118 





0244 





0321 





0528 





0249 





0083 


(} 


0656 


(] 


0516 





0803 





0624 





0820 





0537 





0559 





0771 





0108 





0283 





1178 





0403 


0674 0.0526 





0829 





0427 





0798 





0389 





0341 





0052 


(] 


0162 


(] 


0119 





0595 





0566 





0081 





0197 





0525 


(] 


0108 


(] 


1268 





0481 





0719 





1129 


0020 0.1193 





0442 





0088 





0250 





0927 





0006 





0616 





0684 





0535 





0101 





0089 





1193 





0666 





0589 





0283 





0481 





0394 





0645 





1475 


0375 0.0161 





0542 





0808 





0757 





0429 





0903 





0347 





0126 





0694 





0311 





0536 





0826 





0065 





0433 





1178 





0719 





0645 





0287 





0104 


0020 0.0268 





0112 





0269 





0692 





0272 





0715 





0617 





0582 





0803 





0158 





0441 





0816 





0160 





0657 





0403 





1129 





1475 





0104 





0169 


0344 0.0082 





0439 





0845 





0115 





0365 





0263 





0110 


(] 


0690 





0165 





0845 





0050 





0243 





0602 





0115 


(] 


0674 


(} 


0020 





0375 





0020 





0344 


1773 0.1946 





0467 





0529 





0151 





0154 





0113 





0288 


(] 


0139 


(] 


0185 





0597 





0073 





0311 





0455 





0435 





0526 


(] 


1193 





0161 





0268 





0082 


1946 0.1927 
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The transition matrix for arm 3 is 



0.0512 


(] 


0398 


(] 


0186 





1012 





0880 





0099 





0250 





0948 


(] 


0468 


(1 


0743 





0810 





0130 





0508 





0513 





0932 





0251 





0051 





0453 





0071 


0.0122 





0250 





0019 





0156 





0832 





0578 





0962 





0830 





0360 





0522 





0748 





0140 





0884 





1053 





0440 





0094 





0683 





0078 





0737 


0.0250 





0505 





0590 





0413 





0409 





0907 





0745 





0572 





0256 





0575 





0321 





0725 





0151 





0472 





0051 





0558 





0908 





0418 





0777 


0.0019 


(] 


0590 


(] 


0678 





0932 





0800 





0179 





0399 





0393 


(] 


0156 


(] 


0990 





0134 





0293 





0482 





0440 





0948 





0406 


(] 


0155 





0985 





0833 


0.0156 


(] 


0413 


(] 


0932 





0061 





0507 





0295 





0588 





0905 


(] 


0783 


(] 


0488 





0275 





0807 





0171 





0266 





0669 





0320 


(] 


0771 





0299 





0281 


0.0832 


(] 


0409 


(] 


0800 





0507 





0064 





0550 





0050 





0462 


(] 


0698 


(] 


0204 





0588 





0313 





0223 





0766 





0465 





0512 


(] 


0323 





0767 





0585 


0.0578 





0907 





0179 





0295 





0550 





0249 





0120 





0223 





1130 





0087 





1138 





0928 





0281 





0060 





0811 





0286 





0814 





0505 





0761 


0.0962 





0745 





0399 





0588 





0050 





0120 





0709 





0565 





0784 





1050 





0158 





0043 





0885 





0108 





0278 





0529 


(] 


0285 





1036 





0455 


0.0830 


(] 


0572 


(] 


0393 





0905 





0462 





0223 





0565 





0159 


(] 


0607 





0563 





0359 





0259 





0479 





0295 





0455 





0584 


(] 


0515 





0379 





0447 


0.0360 


(] 


0256 





0156 





0783 





0698 





1130 





0784 





0607 


(] 


0656 


(] 


0453 





0347 





0105 





0639 





0236 





0475 





0415 





0695 





0499 





0239 


0.0522 


(] 


0575 





0990 





0488 





0204 





0087 





1050 





0563 


(] 


0453 


(] 


0744 





0559 





0487 





0183 





0243 





0067 





0421 


(] 


0165 





0734 





0723 


0.0748 





0321 





0134 





0275 





0588 





1138 





0158 





0359 





0347 





0559 





0178 





0155 





1284 





0064 





0132 





1452 





0009 





1132 





0157 


0.0140 





0725 





0293 





0807 





0313 





0928 





0043 





0259 





0105 


(1 


0487 





0155 





0740 





1503 





0878 





0476 





0127 





0679 





0695 





0517 


0.0884 


(] 


0151 


(] 


0482 





0171 





0223 





0281 





0885 





0479 


(] 


0639 


(] 


0183 





1284 





1503 





0253 





0078 





0577 





0046 


(] 


0452 





0366 





0554 


0.1053 





0472 


(] 


0440 





0266 





0766 





0060 





0108 





0295 


(] 


0236 


(] 


0243 





0064 





0878 





0078 





0579 





0753 





0707 





1428 





0106 





0957 


0.0440 


(] 


0051 


(] 


0948 





0669 





0465 





0811 





0278 





0455 


(] 


0475 


(] 


0067 





0132 





0476 





0577 





0753 





0758 





0487 





0863 





0113 





0250 


0.0094 





0558 





0406 





0320 





0512 





0286 





0529 





0584 





0415 





0421 





1452 





0127 





0046 





0707 





0487 





1239 





0639 





0421 





0507 


0.0683 





0908 





0155 





0771 





0323 





0814 





0285 





0515 





0695 





0165 





0009 





0679 





0452 





1428 





0863 





0639 


(] 


0215 





0200 





0150 


0.0078 


(] 


0418 


(] 


0985 





0299 





0767 





0505 





1036 





0379 


(] 


0499 


(] 


0734 





1132 





0695 





0366 





0106 





0113 





0421 





0200 





0228 





0586 


0.0737 





0777 


(] 


0833 





0281 





0585 





0761 





0455 





0447 


(] 


0239 


(] 


0723 





0157 





0517 





0554 





0957 





0250 





0507 


(] 


0150 





0586 





0415 



The transition matrix for arm 4 is 



0.0541 





1087 





0564 





0311 





0663 





0134 





0580 





0345 





0239 





0808 





0066 





0038 





0628 





1039 





0002 





0138 





0820 





0606 





1037 


0.0354 


0.1087 





0879 





0148 





0524 





0073 





0300 





0403 





0335 





0768 





0253 





0651 





0916 





0188 





0305 





0473 





0201 





0574 





0742 





0728 


0.0453 


0.0564 





0148 





0073 





0739 





0583 





1106 


(} 


0338 


(] 


0907 





0350 





1033 





0850 





0352 





0502 


(] 


0254 


(] 


0001 





0534 





0035 





0307 





0543 


0.0782 


0.0311 





0524 





0739 





0546 





0103 





0771 


(} 


0710 


(] 


0422 





0591 





0650 





0709 





0875 





0371 


(] 


0019 


(] 


0627 





0593 





0338 





0454 





0287 


0.0362 


0.0663 





0073 





0583 





0103 





0418 





0822 


(} 


0536 


(] 


0701 





0631 





0506 





0820 





0090 





0219 


(] 


0802 


(] 


0503 





0813 





0045 





0793 





0475 


0.0406 


0.0134 





0300 





1106 





0771 





0822 





0139 





0589 





0044 





0686 





0496 





0350 





0764 





0585 





0368 





0525 





0445 





0894 





0301 





0270 


0.0410 


0.0580 





0403 





0338 





0710 
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The transition matrix for arm 5 is 
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